Speech annotation and corpus tools

نویسندگان

  • Steven Bird
  • Jonathan Harrington
چکیده

The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multi-site annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system of discourse annotation. Today hundreds of annotated speech corpora exist and are used worldwide, and the demand for richly annotated corpora is growing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotation of a Multichannel Noisy Speech Corpus

This paper describes the activity of annotation of an Italian corpus of in-car speech material, with specific reference to the JavaSgram tool, developed with the purpose of annotating multichannel speech corpora. Some pre/post processing tools used with JavaSgram are briefly described together with a synthetic description of the annotation criteria which were adopted. The final objective is tha...

متن کامل

Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools

Computational tools and methods employed in corpus linguistics are split into three main types: compilation, annotation and retrieval. These mirror and support the usual corpus linguistics methodology of corpus collection, manual and/or automatic tagging, followed by query and analysis. Typically, corpus software to support retrieval implements some or all of the five major methods in corpus li...

متن کامل

Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR

This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...

متن کامل

A Basic Language Resource Kit for Persian

Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exp...

متن کامل

LEA - Linguistic Exercises with Annotation Tools

In this paper we present LEA (Linguistic Exercises with Annotation tools). LEA is a new didactic concept helping students to become familiar with corpus linguistic methods and annotation tools. The main idea behind LEA is that classical linguistic exercises are being solved with annotation tools. We will present the advantages of this method (e.g. didactic benefits, automatic correction) and de...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Speech Communication

دوره 33  شماره 

صفحات  -

تاریخ انتشار 2001